{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introductions into using \"statsmodels\" and \"Pandas\"\n",
    "\n",
    "*pandas* has quickly become almost a standard for working with structured data in Python. It often makes code much clearer to\n",
    "read, and it also offers powerful tools for simple import and export of data.\n",
    "\n",
    "*statsmodels* is an advanced package for statistical modeling with Python. Here we will only touch the surface of its extensive functionality. A more extensive introduction is available under\n",
    "http://nbviewer.ipython.org/gist/vincentarelbundock/3485014\n",
    "\n",
    "Author : Thomas Haslwanter, Feb-2017"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Here I import numpy explicitly, so as to make clear where each function comes from\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import statsmodels.formula.api as sm\n",
    "import sys\n",
    "\n",
    "# \"urlopen\" is in a different locations in Python2 and Python3\n",
    "if sys.version_info[0] == 3:\n",
    "    from urllib.request import urlopen\n",
    "else:\n",
    "    from urllib import urlopen\n",
    "    \n",
    "# Show plots inline\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example: Linear regression fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.995\n",
      "Model:                            OLS   Adj. R-squared:                  0.995\n",
      "Method:                 Least Squares   F-statistic:                 1.819e+04\n",
      "Date:                Tue, 14 Apr 2020   Prob (F-statistic):          4.31e-113\n",
      "Time:                        22:11:46   Log-Likelihood:                -147.60\n",
      "No. Observations:                 100   AIC:                             299.2\n",
      "Df Residuals:                      98   BIC:                             304.4\n",
      "Df Model:                           1                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "Intercept    -19.9742      0.212    -94.083      0.000     -20.396     -19.553\n",
      "x              0.4996      0.004    134.857      0.000       0.492       0.507\n",
      "==============================================================================\n",
      "Omnibus:                        1.158   Durbin-Watson:                   2.037\n",
      "Prob(Omnibus):                  0.560   Jarque-Bera (JB):                1.038\n",
      "Skew:                          -0.030   Prob(JB):                        0.595\n",
      "Kurtosis:                       2.504   Cond. No.                         114.\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "# To get reproducable values, I provide a seed value\n",
    "np.random.seed(987654321)   \n",
    "\n",
    "# Generate a noisy line\n",
    "x = np.arange(100)\n",
    "y = 0.5*x - 20 + np.random.randn(len(x))\n",
    "df = pd.DataFrame({'x':x, 'y':y})\n",
    "\n",
    "# Fit a linear model ...\n",
    "# Note the formula language used to denote the relationship between \"x\" and \"y\"\n",
    "model = sm.ols('y~x', data=df).fit()\n",
    "\n",
    "# ... and print an extensive summary of the fit results and model properties\n",
    "print((model.summary()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example from Altman \"Practical statistics for medical research\n",
    "\n",
    "Here I just show how to bring data into the *pandas* format, and how to use its object oriented notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Get the data from the web\n",
    "inFile = 'altman_94.txt'\n",
    "url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'\n",
    "url = url_base + inFile\n",
    "data = np.genfromtxt(urlopen(url), delimiter=',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# Bring them into pandas format\n",
    "lean = pd.Series(data[data[:,1]==1,0])\n",
    "obese = pd.Series(data[data[:,1]==0,0])\n",
    "\n",
    "df = pd.DataFrame({'lean':lean, 'obese':obese})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lean      8.066154\n",
      "obese    10.297778\n",
      "dtype: float64\n",
      "There is a significant difference: p = 0.0007989982111700593\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD5CAYAAAA+0W6bAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADt0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjByYzIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy/EUOrgAAAQWklEQVR4nO3df4zkd13H8eebuyrtLRRoy9gg9DRpysJiJZ0YKkvdzdGmoZVS1MgmxAqrGxJyBQ3KkZU0hCwWNVFMo3Jxz56B7EWRounFow3Z4Vxt0T2g5cryI2qBUuRAoGXaGu6Ot3/sLFz3dm92Zr67s5/d5yOZ7Mxnvt/5vnf6uVe/+/l+v59vZCaSpPI8o98FSJK6Y4BLUqEMcEkqlAEuSYUywCWpUAa4JBVqZ7sFIuIAcCNwIjOHWm3vBW4CfgicAH4zMx9t91kXX3xx7t69u6eC9WNPPPEEu3bt6ncZ0lnsm9U6duzYtzPzkuXt0e488Ii4BmgCf3tGgD87Mx9vPb8VeElmvqVdEfV6Pefn57upXytoNBqMjIz0uwzpLPbNakXEscysL29vO4SSmUeB7yxre/yMl7sArwaSpA3WdghlNRExBfwG8BgwWllFkqQ1aTuEAhARu4G7l4ZQlr33LuCZmXnbKutOABMAtVrtqkOHDvVSr87QbDYZGBjodxnSWeyb1RodHV1xCKWKAL8MOLzSe8s5Bl4txxm1Wdk3q9X1GPgqH3b5GS9fC3yh28IkSd1Zy2mEM8AIcHFEPALcBrwmIq5g8TTCrwBtz0CRJFWrbYBn5tgKzdPrUIskqQNeiSlJher6NEJJioiO1/EmMtVxD1xS1zJzxcdl77x71fdUHQNckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAS1Kh2gZ4RByIiBMRcfyMtj+OiC9ExIMRcVdEPGd9y5QkLbeWPfA7geuXtd0LDGXmzwFfAt5VcV2SpDbaBnhmHgW+s6ztnsw81Xp5P/DT61CbJOkcqhgDfzPwzxV8jiSpAzt7WTkiJoFTwIfPscwEMAFQq9VoNBq9bFJnaDabfp/atOyb66/rAI+IW4AbgT2Zmastl5n7gf0A9Xo9R0ZGut2klmk0Gvh9alM6cti+uQG6CvCIuB54J/BLmflktSVJktZiLacRzgD3AVdExCMRMQ7cATwLuDciPhsRf7XOdUqSlmm7B56ZYys0T69DLZKkDnglpiQVygCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFaqn2QglbX1XvuceHnvqZMfr7d53uKPlLzz/PB647bqOt7OdGeCSzumxp07y8O03dLRONzNldhr4cghFkoplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqHaBnhEHIiIExFx/Iy2X4uIhyLihxFRX98SJUkrWcse+J3A9cvajgOvB45WXZAkaW3aTmaVmUcjYveytgWAiFifqiRJba37bIQRMQFMANRqNRqNxnpvcttoNpt+n9oQnfazbvum/bkz6x7gmbkf2A9Qr9ez0ykmdbaZmRmmpqZYWFhgcHCQyclJxsbG+l2WtqojhzueGrab6WS72c5253zghZmZmWFycpLp6WlOnz7Njh07GB8fBzDEpW3G0wgLMzU1xfT0NKOjo+zcuZPR0VGmp6eZmprqd2mSNthaTiOcAe4DroiIRyJiPCJujohHgKuBwxHx8fUuVIsWFhYYHh5+Wtvw8DALCwt9qkhSv6zlLJTV/i6/q+JatAaDg4PMzc0xOjr6o7a5uTkGBwf7WJWkfnAIpTCTk5OMj48zOzvLqVOnmJ2dZXx8nMnJyX6XJmmDeRCzMEsHKvfu3fujs1CmpqY8gCltQwZ4gcbGxhgbG+vuVC1JW4ZDKJJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAZ4gWZmZhgaGmLPnj0MDQ0xMzPT75Ik9YHzgRfGu9JLWuIeeGG8K72kJWu5K/2BiDgREcfPaHteRNwbEV9u/Xzu+papJd6VXtKSteyB3wlcv6xtH/CJzLwc+ETrtTbA0l3pz+Rd6aXtqW2AZ+ZR4DvLmm8CDraeHwReV3FdWoV3pZe0pNuDmLXM/AZAZn4jIp5fYU06B+9KL2nJup+FEhETwARArVaj0Wis9ya3vEsvvZQ77riDZrPJwMAAgN+r1lWn/avZbHbVJ+3Hnek2wL8ZEZe29r4vBU6stmBm7gf2A9Tr9RwZGelyk1qu0Wjg96l1d+Rwx/2sq77ZxXa2u25PI/wn4JbW81uAf6ymHEnSWq3lNMIZ4D7gioh4JCLGgduBayPiy8C1rdeSpA3UdgglM1c7Oran4lokSR3wSkxJKpQBLkmFMsAlqVAGuCQVyulkJZ3Tswb38bKDXUx3dLD9Ik/fDsANnW9nGzPAJZ3T9xdu5+HbOwvWbi7k2b3vcEfLyyEUSSqWAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQjkboaS2upop8Ehn61x4/nmdb2ObM8AlnVOnU8nCYuB3s54609MQSkS8LSKOR8RDEfH2qoqSJLXXdYBHxBDw28AvAFcCN0bE5VUVJkk6t172wAeB+zPzycw8BXwSuLmasiRJ7fQS4MeBayLiooi4AHgN8MJqypIktdP1QczMXIiI9wP3Ak3gAeDU8uUiYgKYAKjVajQajW43qWWazabfpzYt++b6i8ys5oMi3gc8kpl/sdoy9Xo95+fnK9meurtxrLQRPAulWhFxLDPry9t7Oo0wIp6fmSci4kXA64Gre/k8SdLa9Xoe+D9ExEXASeCtmfndCmqSJK1BTwGema+qqhBJUmecC0WSCuWl9JK6FhGrv/f+ldurOnFC7oFL6kFmrviYnZ1d9T1VxwCXpEIZ4JJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCjnQtnkzjXXxLl4ybK09bkHvsmtNp9EZnLZO+92vglpGzPAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqF6Og88In4H+C0ggc8Bb8rM/6uisO3myvfcw2NPnex4vd37Dq952QvPP48Hbruu421I2py6DvCIeAFwK/CSzHwqIv4OeANwZ0W1bSuPPXWSh2+/oaN1Go0GIyMja16+k7CXtPn1OoSyEzg/InYCFwCP9l6SJGktut4Dz8yvR8SfAF8FngLuycx7li8XERPABECtVqPRaHS7yS2v0++m2Wx2vI7fvzZCN31TnetlCOW5wE3AzwDfA/4+It6YmR86c7nM3A/sB6jX69nJn/zbypHDHQ2HQOdDKN1sQ+pGx31TXellCOXVwH9n5rcy8yTwUeAXqylLktROL2ehfBV4RURcwOIQyh5gvpKqtqFnDe7jZQf3db7iwU62AdDZgVJJm1cvY+CfioiPAJ8GTgGfoTVUos59f+F2z0KR1JGezgPPzNuA2yqqRZLUAa/ElKRCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklSoni6lV7W6mqvkSGe3VJO0dRjgm0SnE1nBYuB3s56krcEhFEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAS1Khug7wiLgiIj57xuPxiHh7lcVJklbX9ZWYmflF4OcBImIH8HXgrorqkiS1UdUQyh7gPzPzKxV9niSpjarmQnkDMFPRZ+kMEXHu99+/cntmrkM1kjaT6PUfekT8BPAo8NLM/OYK708AEwC1Wu2qQ4cO9bQ9/Viz2WRgYKDfZUhnsW9Wa3R09Fhm1pe3VxHgNwFvzczr2i1br9dzfn6+p+3pxxqNBiMjI/0uQzqLfbNaEbFigFcxBj6GwyeStOF6CvCIuAC4FvhoNeVIktaqp4OYmfkkcFFFtUiSOuCVmJJUKANckgplgEtSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIKZYBLUqEMcEkqlAEuSYUywCWpUAa4JBXKAJekQhngklQoA1ySCmWAF2hmZoahoSH27NnD0NAQMzPeU1rajnq6J6Y23szMDJOTk0xPT3P69Gl27NjB+Pg4AGNjY32uTtJGcg+8MFNTU0xPTzM6OsrOnTsZHR1lenqaqampfpcmaYP1FOAR8ZyI+EhEfCEiFiLi6qoK08oWFhYYHh5+Wtvw8DALCwt9qkhSv/S6B/4B4Ehmvhi4EjBF1tng4CBzc3NPa5ubm2NwcLBPFUnql64DPCKeDVwDTANk5g8y83tVFaaVTU5OMj4+zuzsLKdOnWJ2dpbx8XEmJyf7XZqkDdbLQcyfBb4F/E1EXAkcA96WmU9UUplWtHSgcu/evSwsLDA4OMjU1JQHMKVtKDKzuxUj6sD9wCsz81MR8QHg8cx897LlJoAJgFqtdtWhQ4d6LFlLms0mAwMD/S5DOot9s1qjo6PHMrO+vL2XAP8p4P7M3N16/SpgX2besNo69Xo95+fnu9qeztZoNBgZGel3GdJZ7JvViogVA7zrMfDM/B/gaxFxRatpD/D5bj9PktSZXi/k2Qt8OCJ+Avgv4E29lyRJWoueAjwzPwuctVsvSVp/XokpSYXq+iBmVxuL+BbwlQ3b4NZ3MfDtfhchrcC+Wa3LMvOS5Y0bGuCqVkTMr3RkWuo3++bGcAhFkgplgEtSoQzwsu3vdwHSKuybG8AxcEkqlHvgklQoA3wTiohmv2uQVhMRuyPieL/rkAEuScUywDe5iPi9iPiPiHgwIt5zRvvHIuJYRDzUmrJ3qb0ZEVMR8UBE3B8Rtf5Urq0iIn43Io63Hm9vNe+MiIOtfvmRiLigtexVEfHJVt/8eERc2mq/NSI+31r+UKttV0QcaPXvz0TETX36FcuVmT422QNotn5ex+LR/GDxf7Z3A9e03nte6+f5wHHgotbrBH659fyPgD/o9+/jo9wHcBXwOWAXMAA8BLy81c9e2VrmAPAO4Dzg34BLWu2/DhxoPX8U+MnW8+e0fr4PeONSG/AlYFe/f+eSHr3ORqj1dV3r8ZnW6wHgcuAocGtE3Nxqf2Gr/X+BH7AY9LB4l6RrN6xabUXDwF3ZutNWRHwUeBXwtcz819YyHwJuBY4AQ8C9EQGwA/hGa5kHWZy59GPAx1pt1wGvjYh3tF4/E3gR3lt3zQzwzS2AP8zMDz6tMWIEeDVwdWY+GRENFjs/wMls7dIAp/G/sXoTq7QvP/84W8s+lJlXr7D8DSzeQ/e1wLsj4qWt5X8lM79YVbHbjWPgm9vHgTdHxABARLwgIp4PXAh8txXeLwZe0c8itaUdBV4XERdExC7gZuBfgBdFxFJQjwFzwBeBS5baI+K8iHhpRDwDeGFmzgK/z+JwyQCL/XtvtHbXI+LlG/mLbQXunW1imXlPRAwC97X6eBN4I4t/qr4lIh5k8R/N/f2rUltZZn46Iu4E/r3V9NfAd1kc5rglIj4IfBn4y8z8QUT8KvDnEXEhi/nyZyyObX+o1RbAn2bm9yLiva33H2yF+MPAjRv325XPKzElqVAOoUhSoQxwSSqUAS5JhTLAJalQBrgkFcoAl6RCGeCSVCgDXJIK9f87apAiomTqswAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Calculate the mean, ...\n",
    "print(df.mean())\n",
    "\n",
    "# ..., show a boxplot, ...\n",
    "# Note that by using data-frames, the display can get automatically labelled correctly\n",
    "# - also for the next plot\n",
    "df.boxplot(return_type='axes')\n",
    "\n",
    "# ... and find the p-value:\n",
    "t, pVal = stats.ttest_ind(lean, obese)\n",
    "if pVal < 0.05:\n",
    "    print('There is a significant difference: p = {0}'.format(pVal))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
